Skip to content

GH-50186: [C++][Gandiva] REPLACE throws "Buffer overflow for output string" for results larger than 64 KB #50187

Open
lriggs wants to merge 2 commits into
apache:mainfrom
lriggs:replaceLimit
Open

GH-50186: [C++][Gandiva] REPLACE throws "Buffer overflow for output string" for results larger than 64 KB #50187
lriggs wants to merge 2 commits into
apache:mainfrom
lriggs:replaceLimit

Conversation

@lriggs

@lriggs lriggs commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Rationale for this change

Gandiva's REPLACE hardcodes a 65535-byte output buffer, throwing Buffer overflow for output string whenever the result exceeds 64 KB. The cap is arbitrary: Gandiva's variable-length output column already grows dynamically and is only bounded by the int32 offset width (~2 GB). Real queries that replace into large concatenated/aggregated strings fail unnecessarily.

What changes are included in this PR

replace_utf8_utf8_utf8 now sizes the output buffer to the exact result instead of using a fixed cap. The output length of a replace is deterministic:

out_len = text_len + num_matches * (to_str_len - from_str_len)
The wrapper does a single counting pass over the input to find the number of non-overlapping matches of from_str (mirroring the match loop already used in the implementation), computes the exact size in gdv_int64 to avoid intermediate overflow, and passes that as max_length.

The internal replace_with_max_len_utf8_utf8_utf8 is unchanged — its bounds checks now act purely as a correctness backstop (they should never fire with an exact bound), and its explicit-max-length signature remains for the existing unit tests.
When to is shorter than from, the result shrinks and max_length <= text_len, so the shrinking path is sized correctly too.

Are these changes tested?

Yes. Added regression cases to TestStringOps.TestReplace in string_ops_test.cc:

A 35000-char 'X' input with X → XY, producing a 70000-byte result (previously overflowed at 65535) — asserts no error and exact length/content.
A 70000-char shrinking case (XX → X) to cover the shrink path on a >64 KB input.
Full precompiled suite passes locally (132/132), including the existing explicit-max_len overflow tests, which call the internal function directly and are unaffected.

Are there any user-facing changes?

REPLACE now succeeds on results larger than 64 KB instead of erroring. No API or signature changes.

@github-actions

Copy link
Copy Markdown

⚠️ GitHub issue #50186 has been automatically assigned in GitHub to PR creator.

Gandiva's REPLACE hardcoded a 65535-byte output cap, throwing
"Buffer overflow for output string" whenever the result exceeded 64 KB.
Size the output buffer to the exact result instead, by counting
non-overlapping matches of from_str: text_len + num_matches *
(to_str_len - from_str_len). Removes the arbitrary cap; the internal
replace_with_max_len variant and its bounds checks are unchanged.

Gandiva variable-length output uses int32 offsets, so a single output
string cannot exceed INT_MAX (2 GB). Guard that boundary explicitly with
a clear error message instead of letting the int32 size cast wrap
silently (which could otherwise lead to under-allocation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
from_str_len, to_str, to_str_len, 65535,
out_len);
from_str_len, to_str, to_str_len,
static_cast<gdv_int32>(max_length), out_len);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need this argument in the function signature?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thats how the function know how much memory to allocate for the output string.

EXPECT_EQ(std::string(out_str, out_len), "TestString");
EXPECT_FALSE(ctx.has_error());

// Large output (>64 KB) must not overflow: buffer is sized to the exact result.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd feel more comfortable with few specific edge cases but I would not block the PR. Examples:

  • INT_MAX resulting size
  • 0 resulting size

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested one byte past the limit (INT_MAX + 1) rather than exactly INT_MAX, because a result of exactly INT_MAX is allowed and would force a real ~2 GB arena allocation — not something to do in a unit test. The boundary test confirms the guard triggers at precisely the right point.

gdv_int32 to_str_len, gdv_int32* out_len) {
// Count non-overlapping matches to size the output buffer exactly, so large
// results are not capped by an arbitrary limit.
gdv_int64 num_matches = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just reviewed base64 encoding few weeks back and didn't accept double pass variant. See if you can reduce the logic back to single pass or provide benchmarks for single-vs-double variants (0/few/lots replacements) as perhaps resizing is more costly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old vs new is measured without a separate build: the old wrapper was literally replace_with_max_len_utf8_utf8_utf8(…, 65535, …), and that inner function is unchanged. So new = replace_utf8_utf8_utf8 (counting scan + exact alloc) and old = replace_with_max_len_utf8_utf8_utf8 given an adequately-sized buffer (the pre-change algorithm, no scan). The delta is purely the O(n) scan.

Results

case text_len matches new ns/call old ns/call delta
small/dense expand a->ab 256 256 3013.5 2250.3
small/sparse expand a->ab 256 4 1738.6 922.1
medium/dense expand a->ab 65536 65536 753132.4 573520.9
medium/sparse expand a->ab 65536 1024 436435.6 224018.2
large/dense expand a->ab 4194304 4194304 52924267.0 40737654.2
large/sparse expand a->ab 4194304 65536 28408644.3 14630748.5
large/dense shrink ab->a 4194304 0 12771361.1 12659342.5

Interpretation:

The scan's relative cost is consistent across sizes (~30% dense, ~90% sparse), as expected for an O(n) pass — it scales with input length, not match count.
Sparse is the worst case (+90%): with few matches, the old replace pass is mostly cheap byte-copying, so adding a full second comparison pass nearly doubles the work. Dense (+30%): the replace pass already does expansion-copying, so the scan is a smaller fraction.
Shrink path: +0.9% — confirms the refinement works; when to ≤ from the scan is skipped and there's no measurable overhead.

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants